{ "cells": [ { "cell_type": "markdown", "id": "213d6cca", "metadata": {}, "source": [ "# Adapting and Extending GB-GI\n", "\n", "**In this notebook, we will show how to implement novel fitness functions, representations and acquisition functions.**\n", "\n", "Welcome to our notebook on adapting and extending GB-GI! Here, we'll be introducing a fresh perspectives on the GB-GI codebase by implementing alternative fitness functions, molecular representations, and acquisition functions. Throughout this notebook, we'll focus on the practical aspects of adapting the GB-BI code, providing concrete examples through the creation of new classes for these key components. So, if you're eager to learn how to enhance GB-GI's capabilities or want to know how adapt the code for your own purposes, you're in the right place." ] }, { "cell_type": "markdown", "id": "13a79b88", "metadata": {}, "source": [ "## Defining a New Fitness Function\n", "\n", "The most easily adaptable component of GB-BI's internal functionalities is the fitness function. In the ../argenomic/functions/fitness.py file, you can find several fitness functions including the those used in the paper. Below, we show an `Abstract_Fitness` class, which highlights how all of the fitness function classes are designed. Essentially, only the `fitness_function` method needs to be implemented to capture the fitness function you want to implement. Optionally this might include creating helper functions and a more involved use of the config file. " ] }, { "cell_type": "code", "execution_count": null, "id": "856e6d07", "metadata": {}, "outputs": [], "source": [ "class Abstract_Fitness:\n", " \"\"\"\n", " A strategy class for calculating the fitness of a molecule.\n", " \"\"\"\n", " def __init__(self, config) -> None:\n", " self.config = config\n", " return None\n", "\n", " def __call__(self, molecule) -> None:\n", " \"\"\"\n", " Updates the fitness value of a molecule.\n", " \"\"\"\n", " molecule.fitness = self.fitness_function(molecule)\n", " return molecule\n", " \n", " @abstractmethod\n", " def fitness_function(self, molecule) -> float:\n", " raise NotImplementedError" ] }, { "cell_type": "markdown", "id": "c0972f9f", "metadata": {}, "source": [ "For example, we will be implementing the benchmark objective for the design of organic photovoltaics from Tartarus benchmark suite. We load the power conversion efficiency class `pce` from the Tartarus library, based on the Scharber model, and apply it to the SMILES of molecules presented to the `Power_Conversion_Fitness` class. Note that the fitness function includes a penalty based on the synthetic accessibility score (sas).\n", "\n", "References:\n", "- Nigam, AkshatKumar, et al. [Tartarus: A benchmarking platform for realistic and practical inverse molecular design](https://arxiv.org/pdf/2209.12487.pdf). Advances in Neural Information Processing Systems 36 (2024).\n", "- Alharbi, Fahhad H., et al. [An efficient descriptor model for designing materials for solar cells](https://www.nature.com/articles/npjcompumats20153). npj Computational Materials 1.1 (2015): 1-9.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ff158a5c", "metadata": {}, "outputs": [], "source": [ "from tartarus import pce\n", "\n", "class Power_Conversion_Fitness:\n", " \"\"\"\n", " A strategy class for calculating the fitness of a molecule.\n", " \"\"\"\n", " def __init__(self, config) -> None:\n", " self.config = config\n", " return None\n", "\n", " def __call__(self, molecule) -> None:\n", " \"\"\"\n", " Updates the fitness value of a molecule.\n", " \"\"\"\n", " molecule.fitness = self.fitness_function(molecule)\n", " return molecule\n", " \n", " def fitness_function(self, molecule) -> float:\n", " dipole, hl_gap, lumo, obj, pce_1, pce_2, sas = pce.get_properties(molecule.smiles)\n", " return (pce_1 - sas)" ] }, { "cell_type": "markdown", "id": "ba432cb9", "metadata": {}, "source": [ "Finally, don't forget to add the newly designed fitness function class to the `Fitness` class in the ../argenomic/mechanism.py file, as shown below, to make it available in the configuration file. " ] }, { "cell_type": "code", "execution_count": null, "id": "9aee10c6", "metadata": {}, "outputs": [], "source": [ "from argenomic.functions.fitness import Power_Conversion_Fitness\n", "\n", "class Fitness:\n", " @staticmethod\n", " def __new__(self, config):\n", " match config.type:\n", " case \"Fingerprint\":\n", " return Fingerprint_Fitness(config)\n", " case \"USRCAT\":\n", " return USRCAT_Fitness(config)\n", " case \"Zernike\":\n", " return Zernike_Fitness(config)\n", " case \"PCE\":\n", " return Power_Conversion_Fitness(config)\n", " case _:\n", " raise ValueError(f\"{config.type} is not a supported fitness function type.\")" ] }, { "cell_type": "markdown", "id": "00ba6b8d", "metadata": {}, "source": [ "## Defining a New Molecular Representation\n", "\n", "New molecular representations can readily be added to GB-BI. The process is somewhat more involved than adding a new fitness function, due to the large variety of potential molecular representations and the peculiarities of their original implementations. In the ../argenomic/functions/surrogate.py file, you find the `GP_Surrogate` class which contains all the necessary functionality to apply the Tanimoto kernel from GAUCHE to the representations that are being calculated in the `calculate_encodings` method. Because in some cases (e.g. bag-of-words, SELFIES) the representations need to be determined over the combined list of novel and previously seen molecules, there is a separate `add_to_prior_data` method which adds the novel molecules and their fitness values to the memory of the class and re-calculates the encodings. " ] }, { "cell_type": "code", "execution_count": null, "id": "dfd34a7d", "metadata": {}, "outputs": [], "source": [ "class Abstract_Surrogate(GP_Surrogate):\n", " \"\"\"\n", " A strategy class for calculating the fitness of a molecule.\n", " \"\"\"\n", " def __init__(self, config):\n", " super().__init__(config)\n", " return None\n", " \n", " @abstractmethod\n", " def add_to_prior_data(self, molecules):\n", " raise NotImplementedError\n", "\n", " @abstractmethod\n", " def calculate_encodings(self, molecules):\n", " raise NotImplementedError" ] }, { "cell_type": "markdown", "id": "a848af99", "metadata": {}, "source": [ "As an example, we will implement the Avalon fingerprint as a representation for the surrogate GP model. Note that for use in the Tanimoto kernel it is important to return Numpy array versions of the fingerprint vectors from the `calculate_encodings` method. Because there are no complications in calculating fingerprint representations, the `add_to_prior_data` method simply adds the encoding and the fitness of the new molecules to the `self.encodings` and the `self.fitnesses` variables. \n", "\n", "References:\n", "- Griffiths, Ryan-Rhys, et al. \"Gauche: A library for Gaussian processes in chemistry.\" Advances in Neural Information Processing Systems 36 (2024). \n", "- Gedeck, Peter, Bernhard Rohde, and Christian Bartels. \"QSAR− how good is it in practice? Comparison of descriptor sets on an unbiased cross section of corporate data sets.\" Journal of chemical information and modeling 46.5 (2006): 1924-1936." ] }, { "cell_type": "code", "execution_count": null, "id": "18a397d0", "metadata": {}, "outputs": [], "source": [ "from rdkit.Avalon import pyAvalonTools\n", "\n", "class Avalon_Surrogate(GP_Surrogate):\n", " def __init__(self, config):\n", " super().__init__(config)\n", " self.bits = self.config.bits\n", " \n", " def add_to_prior_data(self, molecules):\n", " \"\"\"\n", " Updates the prior data for the surrogate model with new molecules and their fitness values.\n", " \"\"\"\n", " if self.encodings is not None and self.fitnesses is not None:\n", " self.encodings = np.append(self.encodings, self.calculate_encodings(molecules), axis=0)\n", " self.fitnesses = np.append(self.fitnesses, np.array([molecule.fitness for molecule in molecules]), axis=None)\n", " else:\n", " self.encodings = self.calculate_encodings(molecules)\n", " self.fitnesses = np.array([molecule.fitness for molecule in molecules])\n", " return None\n", "\n", " def calculate_encodings(self, molecules):\n", " molecular_graphs = [Chem.MolFromSmiles(Chem.CanonSmiles(molecule.smiles)) for molecule in molecules]\n", " return np.array([pyAvalonTools.GetAvalonFP(molecular_graph, self.bits) for molecular_graph in molecular_graphs]).astype(np.float64)" ] }, { "cell_type": "markdown", "id": "e74bcc1f", "metadata": {}, "source": [ "Once again, at the end of this process, it's necessary to add the newly designed surrogate function as an option in the `Surrogate` class in the ../argenomic/mechanism.py file to make it available in the configuration file. " ] }, { "cell_type": "code", "execution_count": null, "id": "4afe469e", "metadata": {}, "outputs": [], "source": [ "from argenomic.functions.surrogate import Avalon_Surrogate\n", "\n", "class Surrogate:\n", " @staticmethod\n", " def __new__(self, config):\n", " match config.type:\n", " case \"String\":\n", " return String_Surrogate(config)\n", " case \"Fingerprint\":\n", " return Fingerprint_Surrogate(config)\n", " case \"Fingerprint\":\n", " return Avalon_Surrogate(config)\n", " case _:\n", " raise ValueError(f\"{config.type} is not a supported surrogate function type.\")" ] }, { "cell_type": "markdown", "id": "97f34aff", "metadata": { "tags": [] }, "source": [ "## Defining a New Acquisition Function\n", "\n", "A new type of acquisition function can be added to GB-BI in a manner highly similar to adding a new molecular representation. In the ../argenomic/functions/acquisition.py file, you can find the `BO_Acquisition` parent class that encapsulates all the necessary logic to apply Bayesian optimisation to the quality-diversity archive of GB-BI. A novel acquisition function is hence simply made by creating a class that inherits from this class and implements `calculate_acquisition_value` method. Note that the parent class has direct a link to the archive, so the current fitness value of the molecule in the relevant niche can be accessed if necessary." ] }, { "cell_type": "code", "execution_count": 15, "id": "0826fcff", "metadata": {}, "outputs": [], "source": [ "class Abstract_Acquisition(BO_Acquisition):\n", " \"\"\"\n", " A strategy class for the posterior mean of a list of molecules.\n", " \"\"\"\n", " @abstractmethod\n", " def calculate_acquisition_value(self, molecules) -> None:\n", " raise NotImplementedError" ] }, { "cell_type": "markdown", "id": "a889afab", "metadata": {}, "source": [ "To show how this process works, we implement the probability of improvement as a novel acquisition function. First, we inherit from the BO_Acquisition class and then we fill in the calculate_acquisition_value method with the relevant logic. Note the archive is directly accessed to read-out the fitness of the current occupant of the niche the candidate molecule is assigned to. Empty niches have a fitness function value equal to zero. \n", "\n", "References:\n", "- Verhellen, Jonas, and Jeriek Van den Abeele. \"Illuminating elite patches of chemical space.\" Chemical science 11.42 (2020): 11485-11491. \n", "- Kushner, Harold J. \"A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise.\" (1964): 97-106.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "3a1d9344", "metadata": {}, "outputs": [], "source": [ "class Probability_Of_Improvement(BO_Acquisition):\n", " \"\"\"\n", " A strategy class for the probability of improvement of a list of molecules.\n", " \"\"\"\n", " def calculate_acquisition_value(self, molecules) -> None:\n", " \"\"\"\n", " Updates the acquisition value for a list of molecules.\n", " \"\"\"\n", " current_fitnesses = [self.archive.elites[molecule.niche_index].fitness for molecule in molecules] \n", " for molecule, current_fitness in zip(molecules, current_fitnesses):\n", " Z = (molecule.predicted_fitness - current_fitness) / molecule.predicted_uncertainty\n", " molecule.acquisition_value = norm.cdf(Z)\n", " return molecules" ] }, { "cell_type": "markdown", "id": "f5e15cc6", "metadata": {}, "source": [ "Again, for one final time, it is important to remember to add the new acquisition function class to the ../argenomic/mechanism.py file and the `Acquisition` factory class as shown here." ] }, { "cell_type": "code", "execution_count": null, "id": "8b468b98", "metadata": {}, "outputs": [], "source": [ "from argenomic.functions.surrogate import Probability_of_Improvement\n", "\n", "class Acquisition:\n", " @staticmethod\n", " def __new__(self, config): \n", " match config.type:\n", " case 'Mean':\n", " return Posterior_Mean(config)\n", " case 'UCB':\n", " return Upper_Confidence_Bound(config)\n", " case 'EI':\n", " return Expected_Improvement(config)\n", " case 'logEI':\n", " return Log_Expected_Improvement(config)\n", " case 'PI':\n", " return Probability_Of_Improvement(config)\n", " case _:\n", " raise ValueError(f\"{config.type} is not a supported acquisition function type.\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }